Housing prices are an important reflection of the economy, and housing price ranges are of great interest for both buyers and sellers. Ask a home buyer to describe their dream house, and they probably won’t begin with the height of the basement ceiling or the proximity to an east-west railroad. A house value is simply more than location and square footage. Like the features that make up a person, an educated party would want to know all aspects that give a house its value. For example, you want to sell a house, and you don’t know the price which you may expect—it can’t be too low or too high. To find house price you usually try to find similar properties in your neighbourhood and based on gathered data you will try to assess your house price.
The objective of this problem statement is to predict the housing prices of a town or a suburb based on the features of the locality provided and identify the most important features to consider while predicting the prices.
This dataset has 23 features, price being the target variable. The details of all the features are given below:
− cid: a notation for a house
− dayhours: Date house was sold
− price: Price is prediction target (in $)
− room_bed: Number of Bedrooms per house
− room_bath: Number of bathrooms per bedrooms
− living_measure: square footage of the home
− lot_measure: square footage of the lot
− ceil: Total floors (levels) in house
− coast: House which has a view to a waterfront (0 - No, 1 - Yes)
− sight: Has been viewed
− condition: How good the condition is (Overall out of 5)
− quality: grade given to the housing unit, based on grading system
− ceil_measure: square footage of house apart from basement
− basement_measure: square footage of the basement
− yr_built: Built Year
− yr_renovated: Year when house was renovated
− zipcode: zip code
− lat: Latitude coordinate
− long: Longitude coordinate
− living_measure15: Living room area in 2015 (implies-- some renovations) This might or might − not have affected the lot size area
− lot_measure15: lotSize area in 2015 (implies-- some renovations)
− furnished: Based on the quality of room (0 - No, 1 - Yes)
− total_area: Measure of both living and lot
If anybody wants to buy or sell house(or any property) in any city the main problem they face is to fix the right price for that particular house. Due to this they might be offering too low or high for the property.There are 'n' number of attributes which can influence the price for the house. The main intent of this project/work is to determine the right price of a house by considering the various attributes given in the dataset.
# Install any packages that are not already installed
!pip install pandas numpy matplotlib seaborn scikit-learn openpyxl
Requirement already satisfied: pandas in /usr/local/lib/python3.11/dist-packages (2.2.2) Requirement already satisfied: numpy in /usr/local/lib/python3.11/dist-packages (1.26.4) Requirement already satisfied: matplotlib in /usr/local/lib/python3.11/dist-packages (3.10.0) Requirement already satisfied: seaborn in /usr/local/lib/python3.11/dist-packages (0.13.2) Requirement already satisfied: scikit-learn in /usr/local/lib/python3.11/dist-packages (1.6.1) Requirement already satisfied: openpyxl in /usr/local/lib/python3.11/dist-packages (3.1.5) Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.11/dist-packages (from pandas) (2.8.2) Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.11/dist-packages (from pandas) (2025.1) Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.11/dist-packages (from pandas) (2025.1) Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.11/dist-packages (from matplotlib) (1.3.1) Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.11/dist-packages (from matplotlib) (0.12.1) Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.11/dist-packages (from matplotlib) (4.56.0) Requirement already satisfied: kiwisolver>=1.3.1 in /usr/local/lib/python3.11/dist-packages (from matplotlib) (1.4.8) Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.11/dist-packages (from matplotlib) (24.2) Requirement already satisfied: pillow>=8 in /usr/local/lib/python3.11/dist-packages (from matplotlib) (11.1.0) Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.11/dist-packages (from matplotlib) (3.2.1) Requirement already satisfied: scipy>=1.6.0 in /usr/local/lib/python3.11/dist-packages (from scikit-learn) (1.14.1) Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.11/dist-packages (from scikit-learn) (1.4.2) Requirement already satisfied: threadpoolctl>=3.1.0 in /usr/local/lib/python3.11/dist-packages (from scikit-learn) (3.5.0) Requirement already satisfied: et-xmlfile in /usr/local/lib/python3.11/dist-packages (from openpyxl) (2.0.0) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.11/dist-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.impute import KNNImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.cluster import KMeans
from scipy.stats import chi2_contingency
import warnings
warnings.filterwarnings('ignore')
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from IPython import get_ipython
from IPython.display import display
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
from sklearn.linear_model import Ridge, Lasso
from sklearn.ensemble import AdaBoostRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, r2_score
# Load the Excel file
file_path = "House Price Prediction.xlsx"
xls = pd.ExcelFile(file_path)
# Load the data from the identified sheet
df_original = pd.read_excel(xls, sheet_name='innercity')
# Copy of original dataframe
df = df_original.copy()
# Display basic information about the dataset
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 21613 entries, 0 to 21612 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 cid 21613 non-null int64 1 dayhours 21613 non-null object 2 price 21613 non-null int64 3 room_bed 21505 non-null float64 4 room_bath 21505 non-null float64 5 living_measure 21596 non-null float64 6 lot_measure 21571 non-null float64 7 ceil 21571 non-null object 8 coast 21612 non-null object 9 sight 21556 non-null float64 10 condition 21556 non-null object 11 quality 21612 non-null float64 12 ceil_measure 21612 non-null float64 13 basement 21612 non-null float64 14 yr_built 21612 non-null object 15 yr_renovated 21613 non-null int64 16 zipcode 21613 non-null int64 17 lat 21613 non-null float64 18 long 21613 non-null object 19 living_measure15 21447 non-null float64 20 lot_measure15 21584 non-null float64 21 furnished 21584 non-null float64 22 total_area 21584 non-null object dtypes: float64(12), int64(4), object(7) memory usage: 3.8+ MB
# Display basic information about the dataset
df.head()
| cid | dayhours | price | room_bed | room_bath | living_measure | lot_measure | ceil | coast | sight | ... | basement | yr_built | yr_renovated | zipcode | lat | long | living_measure15 | lot_measure15 | furnished | total_area | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3876100940 | 20150427T000000 | 600000 | 4.0 | 1.75 | 3050.0 | 9440.0 | 1 | 0 | 0.0 | ... | 1250.0 | 1966 | 0 | 98034 | 47.7228 | -122.183 | 2020.0 | 8660.0 | 0.0 | 12490 |
| 1 | 3145600250 | 20150317T000000 | 190000 | 2.0 | 1.00 | 670.0 | 3101.0 | 1 | 0 | 0.0 | ... | 0.0 | 1948 | 0 | 98118 | 47.5546 | -122.274 | 1660.0 | 4100.0 | 0.0 | 3771 |
| 2 | 7129303070 | 20140820T000000 | 735000 | 4.0 | 2.75 | 3040.0 | 2415.0 | 2 | 1 | 4.0 | ... | 0.0 | 1966 | 0 | 98118 | 47.5188 | -122.256 | 2620.0 | 2433.0 | 0.0 | 5455 |
| 3 | 7338220280 | 20141010T000000 | 257000 | 3.0 | 2.50 | 1740.0 | 3721.0 | 2 | 0 | 0.0 | ... | 0.0 | 2009 | 0 | 98002 | 47.3363 | -122.213 | 2030.0 | 3794.0 | 0.0 | 5461 |
| 4 | 7950300670 | 20150218T000000 | 450000 | 2.0 | 1.00 | 1120.0 | 4590.0 | 1 | 0 | 0.0 | ... | 0.0 | 1924 | 0 | 98118 | 47.5663 | -122.285 | 1120.0 | 5100.0 | 0.0 | 5710 |
5 rows × 23 columns
# Convert 'dayhours' to datetime format
df['dayhours'] = pd.to_datetime(df['dayhours'], format='%Y%m%dT%H%M%S', errors='coerce')
# Convert 'dayhours' to YYYY-MM format
df['dayhours'] = df['dayhours'].dt.strftime('%Y-%m')
# Rename Column 'dayhours' --> 'year_month'
df.rename(columns={'dayhours':'year_month'}, inplace=True)
# Convert categorical variables to appropriate types
categorical_cols = ['ceil', 'coast', 'condition', 'furnished']
for col in categorical_cols:
df[col] = df[col].astype('category')
# Convert numerical columns stored as objects to appropriate types
numeric_cols = ['yr_built', 'long', 'total_area']
for col in numeric_cols:
df[col] = pd.to_numeric(df[col], errors='coerce')
# Function to check null values, data types and number of unique values in each column.
def get_inform(df):
check_df = pd.DataFrame()
check_df['Missing_values'] = df.isnull().sum()
check_df['data_types'] = df.dtypes
check_df['unique'] = df.nunique()
check_df['duplicated'] = df.duplicated()
return check_df
get_inform(df)
| Missing_values | data_types | unique | duplicated | |
|---|---|---|---|---|
| cid | 0 | int64 | 21436 | NaN |
| year_month | 0 | object | 13 | NaN |
| price | 0 | int64 | 3625 | NaN |
| room_bed | 108 | float64 | 13 | NaN |
| room_bath | 108 | float64 | 30 | NaN |
| living_measure | 17 | float64 | 1038 | NaN |
| lot_measure | 42 | float64 | 9765 | NaN |
| ceil | 42 | category | 7 | NaN |
| coast | 1 | category | 3 | NaN |
| sight | 57 | float64 | 5 | NaN |
| condition | 57 | category | 6 | NaN |
| quality | 1 | float64 | 12 | NaN |
| ceil_measure | 1 | float64 | 946 | NaN |
| basement | 1 | float64 | 306 | NaN |
| yr_built | 15 | float64 | 116 | NaN |
| yr_renovated | 0 | int64 | 70 | NaN |
| zipcode | 0 | int64 | 70 | NaN |
| lat | 0 | float64 | 5034 | NaN |
| long | 34 | float64 | 752 | NaN |
| living_measure15 | 166 | float64 | 774 | NaN |
| lot_measure15 | 29 | float64 | 8682 | NaN |
| furnished | 29 | category | 2 | NaN |
| total_area | 68 | float64 | 11144 | NaN |
# Separate numerical and categorical columns
numerical_cols = df.select_dtypes(include=np.number).columns
categorical_cols = df.select_dtypes(exclude=np.number).columns
# Apply KNN Imputer to numerical columns
imputer = KNNImputer(n_neighbors=5) # You can adjust the number of neighbors
df[numerical_cols] = imputer.fit_transform(df[numerical_cols])
# Handle missing values in categorical columns (using mode imputation as an example)
for col in categorical_cols:
df[col] = df[col].fillna(df[col].mode()[0])
# Verify missing values are handled
df.isnull().sum()
| 0 | |
|---|---|
| cid | 0 |
| year_month | 0 |
| price | 0 |
| room_bed | 0 |
| room_bath | 0 |
| living_measure | 0 |
| lot_measure | 0 |
| ceil | 0 |
| coast | 0 |
| sight | 0 |
| condition | 0 |
| quality | 0 |
| ceil_measure | 0 |
| basement | 0 |
| yr_built | 0 |
| yr_renovated | 0 |
| zipcode | 0 |
| lat | 0 |
| long | 0 |
| living_measure15 | 0 |
| lot_measure15 | 0 |
| furnished | 0 |
| total_area | 0 |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 21613 entries, 0 to 21612 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 cid 21613 non-null float64 1 year_month 21613 non-null object 2 price 21613 non-null float64 3 room_bed 21613 non-null float64 4 room_bath 21613 non-null float64 5 living_measure 21613 non-null float64 6 lot_measure 21613 non-null float64 7 ceil 21613 non-null category 8 coast 21613 non-null category 9 sight 21613 non-null float64 10 condition 21613 non-null category 11 quality 21613 non-null float64 12 ceil_measure 21613 non-null float64 13 basement 21613 non-null float64 14 yr_built 21613 non-null float64 15 yr_renovated 21613 non-null float64 16 zipcode 21613 non-null float64 17 lat 21613 non-null float64 18 long 21613 non-null float64 19 living_measure15 21613 non-null float64 20 lot_measure15 21613 non-null float64 21 furnished 21613 non-null category 22 total_area 21613 non-null float64 dtypes: category(4), float64(18), object(1) memory usage: 3.2+ MB
# Description of the Dataset
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| cid | 21613.0 | 4.580302e+09 | 2.876566e+09 | 1.000102e+06 | 2.123049e+09 | 3.904930e+09 | 7.308900e+09 | 9.900000e+09 |
| price | 21613.0 | 5.401822e+05 | 3.673622e+05 | 7.500000e+04 | 3.219500e+05 | 4.500000e+05 | 6.450000e+05 | 7.700000e+06 |
| room_bed | 21613.0 | 3.371184e+00 | 9.289046e-01 | 0.000000e+00 | 3.000000e+00 | 3.000000e+00 | 4.000000e+00 | 3.300000e+01 |
| room_bath | 21613.0 | 2.115093e+00 | 7.690269e-01 | 0.000000e+00 | 1.750000e+00 | 2.250000e+00 | 2.500000e+00 | 8.000000e+00 |
| living_measure | 21613.0 | 2.079818e+03 | 9.182427e+02 | 2.900000e+02 | 1.430000e+03 | 1.910000e+03 | 2.550000e+03 | 1.354000e+04 |
| lot_measure | 21613.0 | 1.509973e+04 | 4.138666e+04 | 5.200000e+02 | 5.040000e+03 | 7.620000e+03 | 1.068800e+04 | 1.651359e+06 |
| sight | 21613.0 | 2.341276e-01 | 7.657419e-01 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 4.000000e+00 |
| quality | 21613.0 | 7.656855e+00 | 1.175457e+00 | 1.000000e+00 | 7.000000e+00 | 7.000000e+00 | 8.000000e+00 | 1.300000e+01 |
| ceil_measure | 21613.0 | 1.788368e+03 | 8.280834e+02 | 2.900000e+02 | 1.190000e+03 | 1.560000e+03 | 2.210000e+03 | 9.410000e+03 |
| basement | 21613.0 | 2.915173e+02 | 4.425713e+02 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 5.600000e+02 | 4.820000e+03 |
| yr_built | 21613.0 | 1.971005e+03 | 2.936936e+01 | 1.900000e+03 | 1.951000e+03 | 1.975000e+03 | 1.997000e+03 | 2.015000e+03 |
| yr_renovated | 21613.0 | 8.440226e+01 | 4.016792e+02 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 2.015000e+03 |
| zipcode | 21613.0 | 9.807794e+04 | 5.350503e+01 | 9.800100e+04 | 9.803300e+04 | 9.806500e+04 | 9.811800e+04 | 9.819900e+04 |
| lat | 21613.0 | 4.756005e+01 | 1.385637e-01 | 4.715590e+01 | 4.747100e+01 | 4.757180e+01 | 4.767800e+01 | 4.777760e+01 |
| long | 21613.0 | -1.222139e+02 | 1.408197e-01 | -1.225190e+02 | -1.223280e+02 | -1.222300e+02 | -1.221250e+02 | -1.213150e+02 |
| living_measure15 | 21613.0 | 1.986691e+03 | 6.847548e+02 | 3.990000e+02 | 1.490000e+03 | 1.840000e+03 | 2.360000e+03 | 6.210000e+03 |
| lot_measure15 | 21613.0 | 1.276203e+04 | 2.727111e+04 | 6.510000e+02 | 5.100000e+03 | 7.620000e+03 | 1.008700e+04 | 8.712000e+05 |
| total_area | 21613.0 | 1.717963e+04 | 4.156833e+04 | 1.423000e+03 | 7.035000e+03 | 9.580000e+03 | 1.300000e+04 | 1.652659e+06 |
# Extract only year of sale from dayhours
year = []
for i in df['year_month']:
y = int(str(i)[:4])
year.append(y)
df['sold_year']= year
# Convert 'sold_year' and 'yr_built' to integers (if they contain non-numeric values, they will become NaN)
df["sold_year"] = pd.to_numeric(df["sold_year"], errors="coerce")
df["yr_built"] = pd.to_numeric(df["yr_built"], errors="coerce")
# Calculate the 'house_age' feature
df["house_age"] = df["sold_year"] - df["yr_built"] + 1
# Convert specified columns to numeric, then to Int64 and round to integers
columns_to_convert = ['price', 'house_age', 'yr_built', 'living_measure', 'living_measure15', 'lot_measure', 'lot_measure15', 'living_measure15',
'ceil', 'coast', 'sight', 'condition', 'furnished', 'total_area', 'room_bed', 'room_bath', 'ceil_measure', 'basement', 'quality',
'yr_renovated', 'yr_built', 'long', 'lat', 'zipcode', 'sold_year']
for col in columns_to_convert:
# Convert to numeric, coercing errors to NaN
df[col] = pd.to_numeric(df[col], errors='coerce')
# Round to the nearest integer
df[col] = df[col].round(0).astype('Int64')
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 21613 entries, 0 to 21612 Data columns (total 25 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 cid 21613 non-null float64 1 year_month 21613 non-null object 2 price 21613 non-null Int64 3 room_bed 21613 non-null Int64 4 room_bath 21613 non-null Int64 5 living_measure 21613 non-null Int64 6 lot_measure 21613 non-null Int64 7 ceil 21583 non-null Int64 8 coast 21583 non-null Int64 9 sight 21613 non-null Int64 10 condition 21585 non-null Int64 11 quality 21613 non-null Int64 12 ceil_measure 21613 non-null Int64 13 basement 21613 non-null Int64 14 yr_built 21613 non-null Int64 15 yr_renovated 21613 non-null Int64 16 zipcode 21613 non-null Int64 17 lat 21613 non-null Int64 18 long 21613 non-null Int64 19 living_measure15 21613 non-null Int64 20 lot_measure15 21613 non-null Int64 21 furnished 21613 non-null Int64 22 total_area 21613 non-null Int64 23 sold_year 21613 non-null Int64 24 house_age 21613 non-null Int64 dtypes: Int64(23), float64(1), object(1) memory usage: 4.6+ MB
We can observe that after converting some features to numerical and coercing them, there appears some NAN values. We use then KNN Imputer to treat the columns with NANs.
# Separate numerical columns
numerical_cols = df.select_dtypes(include=np.number).columns
# Apply KNN Imputer to numerical columns with NaNs
imputer = KNNImputer(n_neighbors=5) # You can adjust the number of neighbors
df[numerical_cols] = imputer.fit_transform(df[numerical_cols])
# Convert specified columns to int64
for col in columns_to_convert:
df[col] = df[col].astype('int64')
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 21613 entries, 0 to 21612 Data columns (total 25 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 cid 21613 non-null float64 1 year_month 21613 non-null object 2 price 21613 non-null int64 3 room_bed 21613 non-null int64 4 room_bath 21613 non-null int64 5 living_measure 21613 non-null int64 6 lot_measure 21613 non-null int64 7 ceil 21613 non-null int64 8 coast 21613 non-null int64 9 sight 21613 non-null int64 10 condition 21613 non-null int64 11 quality 21613 non-null int64 12 ceil_measure 21613 non-null int64 13 basement 21613 non-null int64 14 yr_built 21613 non-null int64 15 yr_renovated 21613 non-null int64 16 zipcode 21613 non-null int64 17 lat 21613 non-null int64 18 long 21613 non-null int64 19 living_measure15 21613 non-null int64 20 lot_measure15 21613 non-null int64 21 furnished 21613 non-null int64 22 total_area 21613 non-null int64 23 sold_year 21613 non-null int64 24 house_age 21613 non-null int64 dtypes: float64(1), int64(23), object(1) memory usage: 4.1+ MB
# Verify missing values are handled
df.isnull().sum()
| 0 | |
|---|---|
| cid | 0 |
| year_month | 0 |
| price | 0 |
| room_bed | 0 |
| room_bath | 0 |
| living_measure | 0 |
| lot_measure | 0 |
| ceil | 0 |
| coast | 0 |
| sight | 0 |
| condition | 0 |
| quality | 0 |
| ceil_measure | 0 |
| basement | 0 |
| yr_built | 0 |
| yr_renovated | 0 |
| zipcode | 0 |
| lat | 0 |
| long | 0 |
| living_measure15 | 0 |
| lot_measure15 | 0 |
| furnished | 0 |
| total_area | 0 |
| sold_year | 0 |
| house_age | 0 |
# Calculate the unique values of the categorical variables: 'room_bed', 'room_bath', 'ceil', 'coast', 'quality', 'condition', 'sight', 'furnished'
unique_room_bed = sorted(df['room_bed'].unique())
unique_room_bath = sorted(df['room_bath'].unique())
unique_ceil = sorted(df['ceil'].unique())
unique_coast = sorted(df['coast'].unique())
unique_quality = sorted(df['quality'].unique())
unique_condition = sorted(df['condition'].unique())
unique_sight = sorted(df['sight'].unique())
unique_furnished = sorted(df['furnished'].unique())
print("Unique room_bed values (sorted):", unique_room_bed)
print("Unique room_bath values (sorted):", unique_room_bath)
print("Unique ceil values (sorted):", unique_ceil)
print("Unique coast values (sorted):", unique_coast)
print("Unique quality values (sorted):", unique_quality)
print("Unique condition values (sorted):", unique_condition)
print("Unique sight values (sorted):", unique_sight)
print("Unique furnished values (sorted):", unique_furnished)
Unique room_bed values (sorted): [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 33] Unique room_bath values (sorted): [0, 1, 2, 3, 4, 5, 6, 7, 8] Unique ceil values (sorted): [1, 2, 3, 4] Unique coast values (sorted): [0, 1] Unique quality values (sorted): [1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13] Unique condition values (sorted): [1, 2, 3, 4, 5] Unique sight values (sorted): [0, 1, 2, 3, 4] Unique furnished values (sorted): [0, 1]
# Convert categorical variables to appropriate types
cat_cols = ['room_bed', 'room_bath', 'ceil', 'coast', 'quality', 'condition', 'sight', 'furnished']
for col in cat_cols:
df[col] = df[col].astype('category')
def analyze_categorical_features(df, cat_cols):
"""
Calculates the frequency of each category in the specified columns and shows the corresponding mean of the price.
Args:
df: The input DataFrame.
cat_cols: A list of categorical column names.
"""
for col in cat_cols:
print(f"Frequency and Mean Price for {col}:")
# Calculate frequency and mean price
freq_df = df.groupby(col)['price'].agg(['count', 'mean']).reset_index()
print(freq_df)
# Create the plot
fig, ax1 = plt.subplots(figsize=(10, 6))
# Bar plot for frequency on the primary y-axis
ax1.bar(freq_df[col], freq_df['count'], color='lightblue', label='Frequency')
ax1.set_xlabel(col)
ax1.set_ylabel('Frequency', color='lightblue')
ax1.tick_params(axis='y', labelcolor='lightblue')
# Line plot for mean price on the secondary y-axis
ax2 = ax1.twinx()
ax2.plot(freq_df[col], freq_df['mean'], color='green', marker='o', label='Mean Price')
ax2.set_ylabel('Mean Price', color='green')
ax2.tick_params(axis='y', labelcolor='green')
# Add title and legend
plt.title(f'Frequency and Mean Price vs. {col}')
fig.legend(loc="upper left", bbox_to_anchor=(0,1), bbox_transform=ax1.transAxes)
plt.show()
print("-" * 20)
## Calculate the frequency of each category in the specified columns and show the corresponding mean of the price
analyze_categorical_features(df, cat_cols)
Frequency and Mean Price for room_bed: room_bed count mean 0 0 13 4.102231e+05 1 1 197 3.189286e+05 2 2 2755 4.014207e+05 3 3 9821 4.661674e+05 4 4 6897 6.354905e+05 5 5 1598 7.864493e+05 6 6 270 8.274895e+05 7 7 38 9.514478e+05 8 8 13 1.105077e+06 9 9 6 8.939998e+05 10 10 3 8.200000e+05 11 11 1 5.200000e+05 12 33 1 6.400000e+05
-------------------- Frequency and Mean Price for room_bath: room_bath count mean 0 0 14 3.886036e+05 1 1 3920 3.465683e+05 2 2 13868 5.005330e+05 3 3 2530 7.474795e+05 4 4 1194 1.078614e+06 5 5 57 1.847437e+06 6 6 24 2.603958e+06 7 7 2 2.735000e+06 8 8 4 4.330000e+06
-------------------- Frequency and Mean Price for ceil: ceil count mean 0 1 10711 442630.375875 1 2 10284 638930.905290 2 3 610 583124.773770 3 4 8 933937.500000
-------------------- Frequency and Mean Price for coast: coast count mean 0 0 21452 5.317155e+05 1 1 161 1.668301e+06
-------------------- Frequency and Mean Price for quality: quality count mean 0 1 1 1.420000e+05 1 3 3 2.056667e+05 2 4 29 2.143810e+05 3 5 242 2.485240e+05 4 6 2038 3.019166e+05 5 7 8981 4.025933e+05 6 8 6068 5.428955e+05 7 9 2615 7.737382e+05 8 10 1134 1.072347e+06 9 11 399 1.497792e+06 10 12 90 2.192500e+06 11 13 13 3.710769e+06
-------------------- Frequency and Mean Price for condition: condition count mean 0 1 30 334431.666667 1 2 172 326269.122093 2 3 14056 542337.691662 3 4 5661 520774.761526 4 5 1694 612515.489965
-------------------- Frequency and Mean Price for sight: sight count mean 0 0 19489 4.966133e+05 1 1 335 8.174811e+05 2 2 961 7.911354e+05 3 3 510 9.724684e+05 4 4 318 1.466554e+06
-------------------- Frequency and Mean Price for furnished: furnished count mean 0 0 17367 437403.973628 1 1 4246 960565.753179
--------------------
# Replacing room_bed values with low frequency with room_bed values that are closer in mean_price
df.room_bed = df.room_bed.replace({0: 2, 7: 6, 8: 6, 9: 6, 10: 6, 11: 3, 33: 4})
# Replacing room_bath values with low frequency with room_bath values that are closer in mean_price
df.room_bath = df.room_bath.replace({0: 1, 4: 3, 5: 3, 6: 3, 7: 3, 8: 2})
# Replacing ceil values with low frequency with ceil values that are closer in mean_price
df.ceil = df.ceil.replace({4: 2})
# Replacing sight values with low frequency with sight values that are closer in mean_price
df.sight = df.sight.replace({4: 3})
# Replacing quality values with low frequency with quality values that are closer in mean_price
df.quality = df.quality.replace({1: 6, 3: 6, 4: 6, 5: 6, 10: 6, 11: 6, 12: 6, 13: 6})
# Replacing condition values with low frequency with condition values that are closer in mean_price
df.condition = df.condition.replace({1: 2})
## Calculate the frequency of each category in the specified columns and show the corresponding mean of the price
analyze_categorical_features(df, cat_cols)
Frequency and Mean Price for room_bed: room_bed count mean 0 1 197 318928.598985 1 2 2768 401462.078035 2 3 9822 466172.854714 3 4 6898 635491.172514 4 5 1598 786449.296621 5 6 330 853839.906061
-------------------- Frequency and Mean Price for room_bath: room_bath count mean 0 1 3934 346717.852059 1 2 13872 501637.196583 2 3 3807 880551.031521
-------------------- Frequency and Mean Price for ceil: ceil count mean 0 1 10711 442630.375875 1 2 10292 639160.214730 2 3 610 583124.773770
-------------------- Frequency and Mean Price for coast: coast count mean 0 0 21452 5.317155e+05 1 1 161 1.668301e+06
-------------------- Frequency and Mean Price for quality: quality count mean 0 6 3949 694264.626234 1 7 8981 402593.324351 2 8 6068 542895.499670 3 9 2615 773738.222180
-------------------- Frequency and Mean Price for condition: condition count mean 0 2 202 327481.381188 1 3 14056 542337.691662 2 4 5661 520774.761526 3 5 1694 612515.489965
-------------------- Frequency and Mean Price for sight: sight count mean 0 0 19489 4.966133e+05 1 1 335 8.174811e+05 2 2 961 7.911354e+05 3 3 828 1.162226e+06
-------------------- Frequency and Mean Price for furnished: furnished count mean 0 0 17367 437403.973628 1 1 4246 960565.753179
--------------------
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 21613 entries, 0 to 21612 Data columns (total 25 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 cid 21613 non-null float64 1 year_month 21613 non-null object 2 price 21613 non-null int64 3 room_bed 21613 non-null category 4 room_bath 21613 non-null category 5 living_measure 21613 non-null int64 6 lot_measure 21613 non-null int64 7 ceil 21613 non-null category 8 coast 21613 non-null category 9 sight 21613 non-null category 10 condition 21613 non-null category 11 quality 21613 non-null category 12 ceil_measure 21613 non-null int64 13 basement 21613 non-null int64 14 yr_built 21613 non-null int64 15 yr_renovated 21613 non-null int64 16 zipcode 21613 non-null int64 17 lat 21613 non-null int64 18 long 21613 non-null int64 19 living_measure15 21613 non-null int64 20 lot_measure15 21613 non-null int64 21 furnished 21613 non-null category 22 total_area 21613 non-null int64 23 sold_year 21613 non-null int64 24 house_age 21613 non-null int64 dtypes: category(8), float64(1), int64(15), object(1) memory usage: 3.0+ MB
# Detecting outliers using boxplots for key numerical features
outlier_features = ['price', 'living_measure', 'living_measure15', 'lot_measure', 'lot_measure15', 'total_area', 'ceil_measure', 'basement']
for feature in outlier_features:
plt.figure(figsize=(8, 4))
sns.boxplot(x=df[feature])
plt.title(f'Boxplot of {feature} (Outlier Detection)')
plt.xlabel(feature)
plt.show()
# Verify the new distributions after Log1p transform
for feature in outlier_features:
# Apply log1p transformation to handle outliers
df[feature + '_log1p'] = np.log1p(df[feature])
plt.figure(figsize=(8, 4))
sns.boxplot(x=df[feature + '_log1p'])
plt.title(f'Boxplot of {feature}_log1p After Outlier Treatment')
plt.xlabel(feature + '_log1p')
plt.show()
# Drop the original outlier features
df = df.drop(columns=outlier_features)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 21613 entries, 0 to 21612 Data columns (total 25 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 cid 21613 non-null float64 1 year_month 21613 non-null object 2 room_bed 21613 non-null category 3 room_bath 21613 non-null category 4 ceil 21613 non-null category 5 coast 21613 non-null category 6 sight 21613 non-null category 7 condition 21613 non-null category 8 quality 21613 non-null category 9 yr_built 21613 non-null int64 10 yr_renovated 21613 non-null int64 11 zipcode 21613 non-null int64 12 lat 21613 non-null int64 13 long 21613 non-null int64 14 furnished 21613 non-null category 15 sold_year 21613 non-null int64 16 house_age 21613 non-null int64 17 price_log1p 21613 non-null float64 18 living_measure_log1p 21613 non-null float64 19 living_measure15_log1p 21613 non-null float64 20 lot_measure_log1p 21613 non-null float64 21 lot_measure15_log1p 21613 non-null float64 22 total_area_log1p 21613 non-null float64 23 ceil_measure_log1p 21613 non-null float64 24 basement_log1p 21613 non-null float64 dtypes: category(8), float64(9), int64(7), object(1) memory usage: 3.0+ MB
Mitigating Outliers in Numerical Data with Log1p Transformation
Using the log1p transformation is a common technique to handle outliers in numerical features. This transformation applies the natural logarithm to the feature values after adding 1, which helps to reduce the impact of extreme values.
This transformation can help make the data more normally distributed, which is beneficial for many machine learning algorithms.
# Number of Houses Sold Over Time
print(df['year_month'].value_counts().sort_index())
year_month 2014-05 1768 2014-06 2180 2014-07 2211 2014-08 1940 2014-09 1774 2014-10 1878 2014-11 1411 2014-12 1471 2015-01 978 2015-02 1250 2015-03 1875 2015-04 2231 2015-05 646 Name: count, dtype: int64
# Calculate the aggregated mean price per month
mean_price_log1p_per_month = df.groupby(['year_month'])['price_log1p'].agg('mean')
print(mean_price_log1p_per_month)
year_month 2014-05 13.061634 2014-06 13.079807 2014-07 13.059797 2014-08 13.044293 2014-09 13.034473 2014-10 13.038289 2014-11 13.011228 2014-12 13.017619 2015-01 13.008521 2015-02 12.992569 2015-03 13.051457 2015-04 13.096213 2015-05 13.075737 Name: price_log1p, dtype: float64
# Create the plot
plt.figure(figsize=(12, 6))
df['year_month'].value_counts().sort_index().plot(kind='line', marker='o')
plt.xlabel('Year-Month')
plt.ylabel('Number of Houses Sold')
plt.title('Number of Houses Sold Over Time')
plt.grid(True)
plt.xticks(rotation=45, ha='right') # Rotate x-axis labels for better readability
plt.tight_layout()
plt.show()
Plot 1: Number of Houses Sold Over Time
# Calculate the aggregated mean price per month
mean_price_log1p_per_month = df.groupby(['year_month'])['price_log1p'].agg('mean')
# Create the plot with two y-axes
fig, ax1 = plt.subplots(figsize=(12, 6))
# Plot the number of houses sold
ax1.plot(df['year_month'].value_counts().sort_index().index,
df['year_month'].value_counts().sort_index().values,
marker='o', color='blue', label='Number of Houses Sold')
ax1.set_xlabel('Year-Month')
ax1.set_ylabel('Number of Houses Sold', color='blue')
ax1.tick_params(axis='y', labelcolor='blue')
plt.xticks(rotation=45, ha='right')
# Create a second y-axis
ax2 = ax1.twinx()
ax2.plot(mean_price_log1p_per_month.index, mean_price_log1p_per_month.values,
marker='o', color='red', label='Mean Price_log1p per Month')
ax2.set_ylabel('Mean Price_log1p per Month', color='red')
ax2.tick_params(axis='y', labelcolor='red')
# Add title and legend
plt.title('Number of Houses Sold and Mean Price_log1p per Month Over Time')
fig.legend(loc="upper left", bbox_to_anchor=(0,1), bbox_transform=ax1.transAxes)
plt.grid(True)
plt.tight_layout()
plt.show()
# Find the month with the highest sales:
max_sales_month = df['year_month'].value_counts().idxmax()
print(f"Month with highest sales: {max_sales_month}")
# Find the month with the lowest sales:
min_sales_month = df['year_month'].value_counts().idxmin()
print(f"Month with lowest sales: {min_sales_month}")
# Calculate the correlation between sales volume and average price
sales_volume = df['year_month'].value_counts().sort_index().values
mean_price = df.groupby('year_month')['price_log1p'].mean().values
correlation = np.corrcoef(sales_volume, mean_price)[0,1]
print(f"Correlation between sales volume and average price_log1p: {correlation}")
Month with highest sales: 2015-04 Month with lowest sales: 2015-05 Correlation between sales volume and average price_log1p: 0.4685189596268225
Plot 2: Number of Houses Sold and Mean Price per Month
Sales Trend:
Price_log1p Trend:
Potential further investigations:
# Calculate the aggregated mean price per month
df['avg_price_log1p_zipcode'] = df.groupby(['zipcode'])['price_log1p'].transform('mean')
print(df[['zipcode','price_log1p','avg_price_log1p_zipcode']].head())
zipcode price_log1p avg_price_log1p_zipcode 0 98034 13.304687 13.076286 1 98118 12.154785 12.841957 2 98118 13.507627 12.841957 3 98002 12.456835 12.337707 4 98118 13.017005 12.841957
# Convert 'zipcode' to integer first, then to categorical
df['zipcode'] = df['zipcode'].astype(int)
df['zipcode'] = df['zipcode'].astype('category')
# Plotting 'avg_price_log1p_zipcode' per 'zipcode'
plt.figure(figsize=(10, 6))
sns.lineplot(x='zipcode', y='avg_price_log1p_zipcode', data=df)
plt.xlabel('Zipcode')
plt.ylabel('Average Log1p Price')
plt.title('Average Log1p Price per Zipcode')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()
Insights and conclusions:
The plot shows the relationship between zip code and the average log-transformed price.
There are variations in average prices across different zip codes. Some zip codes show higher average prices, while others have lower average prices.
Next step is to identify any clusters or patterns in the data. Are there groups of zip codes with similar average prices?
# Prepare the data for clustering
# Select relevant features for clustering
features_for_clustering = ['avg_price_log1p_zipcode', 'lat', 'long']
X = df[features_for_clustering]
# Standardize the features (important for KMeans)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Determine the optimal number of clusters (e.g., using the Elbow method)
inertia = []
for k in range(1, 11): # Test k from 1 to 10
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X_scaled)
inertia.append(kmeans.inertia_)
plt.figure(figsize=(8, 6))
plt.plot(range(1, 11), inertia, marker='o')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal k')
plt.show()
# Based on the Elbow method plot, choose an appropriate 'k'
optimal_k = 3 # example, replace with your chosen k
# Perform KMeans clustering
kmeans = KMeans(n_clusters=optimal_k, random_state=42)
df['cluster'] = kmeans.fit_predict(X_scaled)
# Plot the clusters on a map (using latitude and longitude)
plt.figure(figsize=(8, 6))
# Calculate the average latitude and longitude for each cluster
cluster_centers = df.groupby('cluster')[['lat', 'long']].mean()
<Figure size 800x600 with 0 Axes>
# Plot the clusters
for cluster in cluster_centers.index:
plt.scatter(cluster_centers.loc[cluster, 'long'], cluster_centers.loc[cluster, 'lat'], label=f'Cluster {cluster}')
plt.xlabel('Average Longitude')
plt.ylabel('Average Latitude')
plt.title('Clusters of Zipcodes based on Price and Location')
plt.legend()
plt.show()
# More straightforward representation: Boxplot of Average Price by Cluster
plt.figure(figsize=(8, 6))
sns.boxplot(x='cluster', y='avg_price_log1p_zipcode', data=df, palette='viridis')
plt.xlabel('Cluster')
plt.ylabel('Average Price (Log1p)')
plt.title('Average Price (Log1p) per Cluster')
plt.show()
# Interpretation of Clusters:
for cluster in range(optimal_k):
cluster_data = df[df['cluster'] == cluster]
print(f"Cluster {cluster}:")
print("Average Price (log1p):", cluster_data['avg_price_log1p_zipcode'].mean())
print("Average Latitude:", cluster_data['lat'].mean())
print("Average Longitude:", cluster_data['long'].mean())
print("Other relevant statistics...") # add more analysis here
print("-" * 20)
Cluster 0: Average Price (log1p): 12.652643982988419 Average Latitude: 47.0 Average Longitude: -122.00272479564033 Other relevant statistics... -------------------- Cluster 1: Average Price (log1p): 13.060406177474718 Average Latitude: 47.99990069513406 Average Longitude: -121.99831181727905 Other relevant statistics... -------------------- Cluster 2: Average Price (log1p): 13.551217098121123 Average Latitude: 48.0 Average Longitude: -122.0 Other relevant statistics... --------------------
# Analyze Cluster characteristics in terms of Price, House Size, House Age, Condition
for cluster in range(optimal_k):
cluster_data = df[df['cluster'] == cluster]
print(f"Cluster {cluster}:")
print("Median Price:", cluster_data['avg_price_log1p_zipcode'].median())
print("Median House Size (living_measure):", cluster_data['living_measure_log1p'].median())
print("Median House Age:", cluster_data['house_age'].median())
# Instead of median, use mode for 'condition' (most frequent category)
print("Mode Condition:", cluster_data['condition'].mode()[0]) # Get the most frequent category
# Create visualizations
plt.figure(figsize=(10, 6))
plt.subplot(2, 2, 1)
sns.histplot(cluster_data['avg_price_log1p_zipcode'], kde=True)
plt.title('Price Distribution')
plt.subplot(2, 2, 2)
sns.boxplot(y=cluster_data['living_measure_log1p'])
plt.title('House Size Distribution')
plt.subplot(2, 2, 3)
sns.histplot(cluster_data['house_age'], kde=True)
plt.title('House Age Distribution')
plt.subplot(2, 2, 4)
sns.countplot(x=cluster_data['condition'])
plt.title('Condition Distribution')
plt.tight_layout()
plt.show()
print("-" * 20)
Cluster 0: Median Price: 12.598429550345802 Median House Size (living_measure): 7.523481312573497 Median House Age: 37.0 Mode Condition: 3
-------------------- Cluster 1: Median Price: 13.123764050672976 Median House Size (living_measure): 7.473637108496206 Median House Age: 53.0 Mode Condition: 3
-------------------- Cluster 2: Median Price: 13.536099467939726 Median House Size (living_measure): 7.80016307039296 Median House Age: 36.0 Mode Condition: 3
--------------------
Comparative Analysis:
These characteristics suggest that Cluster 2 represents newer and larger houses with higher prices, while Cluster 1 represents older houses with slightly lower prices. Cluster 0 falls somewhere in between in terms of price and house size but has a similar house age to Cluster 2. The condition of houses is consistent across all clusters.
# Plot distributions of key numerical features
num_features = ['price_log1p', 'living_measure_log1p', 'living_measure15_log1p', 'lot_measure_log1p', 'lot_measure15_log1p', 'total_area_log1p', 'ceil_measure_log1p', 'basement_log1p', 'yr_renovated', 'yr_built', 'house_age']
for feature in num_features:
plt.figure(figsize=(8, 4))
sns.histplot(df[feature], bins=30, kde=True)
plt.title(f'Distribution of {feature}')
plt.xlabel(feature)
plt.ylabel('Frequency')
plt.show()
# Analyze numerical features' distributions
num_features = ['price_log1p', 'living_measure_log1p', 'living_measure15_log1p', 'lot_measure_log1p', 'lot_measure15_log1p', 'total_area_log1p', 'ceil_measure_log1p', 'basement_log1p', 'yr_renovated', 'yr_built', 'house_age']
for feature in num_features:
print(f"Analysis of {feature}:")
print(f" - Mean: {df[feature].mean()}")
print(f" - Median: {df[feature].median()}")
print(f" - Standard Deviation: {df[feature].std()}")
print(f" - Minimum: {df[feature].min()}")
print(f" - Maximum: {df[feature].max()}")
print(f" - Skewness: {df[feature].skew()}") # Measure of asymmetry
print("-" * 20)
Analysis of price_log1p: - Mean: 13.04788854727321 - Median: 13.017005083966255 - Standard Deviation: 0.5268208317342267 - Minimum: 11.225256725762893 - Maximum: 15.856731016694035 - Skewness: 0.4289733115510522 -------------------- Analysis of living_measure_log1p: - Mean: 7.550921743643015 - Median: 7.555381944240273 - Standard Deviation: 0.4244223901423487 - Minimum: 5.673323267171493 - Maximum: 9.513477398982824 - Skewness: -0.03424134803126608 -------------------- Analysis of living_measure15_log1p: - Mean: 7.54017679159662 - Median: 7.518064181233078 - Standard Deviation: 0.3270191172835487 - Minimum: 5.991464547107982 - Maximum: 8.734077192559303 - Skewness: 0.20700071463081818 -------------------- Analysis of lot_measure_log1p: - Mean: 8.990349781591675 - Median: 8.938662873666118 - Standard Deviation: 0.90197599148563 - Minimum: 6.255750041753367 - Maximum: 14.31710974879976 - Skewness: 0.9616692934599284 -------------------- Analysis of lot_measure15_log1p: - Mean: 8.96118088492313 - Median: 8.938662873666118 - Standard Deviation: 0.8127474964538918 - Minimum: 6.480044561926653 - Maximum: 13.677627998442286 - Skewness: 0.965444589549508 -------------------- Analysis of total_area_log1p: - Mean: 9.25368074014435 - Median: 9.167537249650874 - Standard Deviation: 0.7734408686873266 - Minimum: 7.261225091971921 - Maximum: 14.317896669016395 - Skewness: 1.458455244142715 -------------------- Analysis of ceil_measure_log1p: - Mean: 7.39553689291715 - Median: 7.353081920515432 - Standard Deviation: 0.4273553561452124 - Minimum: 5.673323267171493 - Maximum: 9.149634496858788 - Skewness: 0.25409748880780564 -------------------- Analysis of basement_log1p: - Mean: 2.529532833359141 - Median: 0.0 - Standard Deviation: 3.170627734612623 - Minimum: 0.0 - Maximum: 8.480736654405622 - Skewness: 0.4844241165270338 -------------------- Analysis of yr_renovated: - Mean: 84.40225790033776 - Median: 0.0 - Standard Deviation: 401.6792400190889 - Minimum: 0 - Maximum: 2015 - Skewness: 4.549493366993928 -------------------- Analysis of yr_built: - Mean: 1971.0051820663489 - Median: 1975.0 - Standard Deviation: 29.36933926744887 - Minimum: 1900 - Maximum: 2015 - Skewness: -0.46977454335919794 -------------------- Analysis of house_age: - Mean: 44.31777171146995 - Median: 41.0 - Standard Deviation: 29.371491382119054 - Minimum: 0 - Maximum: 116 - Skewness: 0.46913300394522467 --------------------
Plot 3: Histograms of Numerical Features:
price_log1p: Slightly skewed right with a mean of 13.05 and a median of 13.02. The standard deviation is 0.53, ranging from 11.23 to 15.86.
living_measure_log1p: Nearly symmetric with a mean of 7.55 and a median of 7.56. The standard deviation is 0.42, ranging from 5.67 to 9.51.
living_measure15_log1p: Slightly skewed right with a mean of 7.54 and a median of 7.52. The standard deviation is 0.33, ranging from 5.99 to 8.73.
lot_measure_log1p: Skewed right with a mean of 8.99 and a median of 8.94. The standard deviation is 0.90, ranging from 6.26 to 14.32.
lot_measure15_log1p: Skewed right with a mean of 8.96 and a median of 8.94. The standard deviation is 0.81, ranging from 6.48 to 13.68.
total_area_log1p: Highly skewed right with a mean of 9.25 and a median of 9.17. The standard deviation is 0.77, ranging from 7.26 to 14.32.
ceil_measure_log1p: Slightly skewed right with a mean of 7.40 and a median of 7.35. The standard deviation is 0.43, ranging from 5.67 to 9.15.
basement_log1p: Slightly skewed right with a mean of 2.53 and a median of 0. The standard deviation is 3.17, ranging from 0 to 8.48.
yr_renovated: Highly skewed right with a mean of 84.40 and a median of 0. The standard deviation is 401.68, ranging from 0 to 2015.
yr_built: Slightly skewed left with a mean of 1971.01 and a median of 1975. The standard deviation is 29.37, ranging from 1900 to 2015.
house_age: Slightly skewed right with a mean of 44.32 and a median of 41. The standard deviation is 29.37, ranging from 0 to 116.
# Compute correlation matrix
# get numerical columns
numerical_df = df.select_dtypes(include=np.number)
# compute the correlation matrix using only numerical data
corr_matrix = numerical_df.corr()
# Display heatmap of correlations with price
plt.figure(figsize=(12, 6))
sns.heatmap(corr_matrix[['price_log1p']].sort_values(by='price_log1p', ascending=False), annot=True, cmap='coolwarm', linewidths=0.5)
plt.title("Feature Correlation with House Price_log1p")
plt.show()
Plot 4: Feature Correlation with Price
Strongest positive correlations with price:
Strongest negative correlations with price:
# Correlation Matrix Heatmap
# get numerical columns
numerical_df = df.select_dtypes(include=np.number)
# compute the correlation matrix using only numerical data
corr_matrix = numerical_df.corr()
plt.figure(figsize=(14, 12))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".1f")
plt.title('Correlation Matrix of Numerical Features')
plt.show()
#Since 'total_area_log1p' and 'lot_measure_log1p' are 100% positively correlated, drop 'lot_measure' from df
df = df.drop('lot_measure_log1p', axis=1)
#Since 'house_age' and 'yr_built' are 100% negatively correlated, drop 'yr_built' from df
df = df.drop('yr_built', axis=1)
# Correlation Matrix Heatmap
# get numerical columns
numerical_df = df.select_dtypes(include=np.number)
# compute the correlation matrix using only numerical data
corr_matrix = numerical_df.corr()
plt.figure(figsize=(14, 12))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".1f")
plt.title('Correlation Matrix of Numerical Features')
plt.show()
Plot 5: Correlation Matrix Heatmap
living_measure_log1p and ceil_measure_log1p: A very strong positive correlation (0.9) shows properties with larger living areas also tend to have larger ceil areas. This is expected as the living area is part of the ceil area.
living_measure_log1p and living_measure15_log1p: A high positive correlation (0.7) means that houses with larger living areas tend to also have larger recently renovated or updated living areas. This reinforces the importance of living space size in general.
living_measure15_log1p and ceil_measure_log1p: A strong positive correlation (0.7) indicates that properties with larger recently renovated or updated living areas generally have larger ceil areas. This is expected as the living area is part of the ceil area.
living_measure_log1p and price_log1p: A strong positive correlation (0.7) suggests that larger living spaces tend to command higher prices.
living_measure15_log1p and price_log1p: A notable positive correlation (0.6) indicates that recently renovated or updated living areas are associated with higher prices. This, combined with the 'living_measure' correlation, highlights the importance of living space size and its recent condition.
ceil_measure_log1p and price_log1p: A notable positive correlation (0.6) implies that larger overall ceiling area are linked to increased property value.
# Box plots to visualize the distribution of numerical features for different categories
num_features = ['price_log1p', 'living_measure_log1p', 'ceil_measure_log1p', 'basement_log1p', 'total_area_log1p', 'house_age']
for col in num_features:
plt.figure(figsize=(10, 6))
sns.boxplot(x='condition', y=col, data=df, palette='viridis')
plt.title(f'Box plot of {col} by Condition')
plt.show()
# Get the numeric columns
numerical_cols = ['price_log1p', 'living_measure_log1p', 'ceil_measure_log1p', 'basement_log1p', 'total_area_log1p', 'house_age']
for col in numerical_cols:
medians_condition = df.groupby('condition')[col].median()
print(f"Median {col} by Condition:\n{medians_condition}\n")
Median price_log1p by Condition: condition 2 12.531750 3 13.019225 4 12.994532 5 13.172000 Name: price_log1p, dtype: float64 Median living_measure_log1p by Condition: condition 2 7.143610 3 7.586296 4 7.507141 5 7.539559 Name: living_measure_log1p, dtype: float64 Median ceil_measure_log1p by Condition: condition 2 7.039660 3 7.427144 4 7.266129 5 7.259116 Name: ceil_measure_log1p, dtype: float64 Median basement_log1p by Condition: condition 2 0.000000 3 0.000000 4 0.000000 5 6.122433 Name: basement_log1p, dtype: float64 Median total_area_log1p by Condition: condition 2 9.201401 3 9.133675 4 9.252058 5 9.134107 Name: total_area_log1p, dtype: float64 Median house_age by Condition: condition 2 66.0 3 27.0 4 53.0 5 65.0 Name: house_age, dtype: float64
Plot 6: Distributions of the following numerical features _['price_log1p', 'living_measure_log1p', 'ceil_measure_log1p', 'basement_log1p', 'total_area_log1p', 'houseage'] which vary across different Conditions:
price_log1p against the different condition categories
price_log1p increases with better condition, ranging from 12.48 for condition 1 to 13.17 for condition 5. This indicates that properties in better condition tend to have higher prices.living_measure_log1p against the different condition categories
living_measure_log1p increases with better condition, ranging from 6.91 for condition 1 to 7.59 for condition 3. This suggests that properties in better condition tend to have larger living areas.ceil_measure_log1p against the different condition categories
ceil_measure_log1p increases with better condition, ranging from 6.87 for condition 1 to 7.43 for condition 3. This indicates that ceiling measurements are generally larger in better condition properties.basement_log1p against the different condition categories
basement_log1p is 0 for conditions 1 to 4, and 6.11 for condition 5. This shows that basements are more common in properties in the best condition.total_area_log1p against the different condition categories
total_area_log1p is relatively consistent across conditions, ranging from 9.13 for condition 3 to 9.25 for condition 4. This suggests that total area does not vary significantly with condition.house_age against the different condition categories
condition, ranging from 87 years for condition 1 to 27 years for condition 3. This indicates that older houses tend to be in poorer condition, but there are also some older houses in better condition.# Box plots to visualize the distribution of numerical features for different categories
for col in numerical_cols:
plt.figure(figsize=(10, 6))
sns.boxplot(x='quality', y=col, data=df, palette='viridis')
plt.title(f'Box plot of {col} by Quality')
plt.xlabel('Quality')
plt.ylabel(col)
plt.show()
for col in numerical_cols:
medians_quality = df.groupby('quality')[col].median()
print(f"Median {col} by Quality:\n{medians_quality}\n")
Median price_log1p by Quality: quality 6 12.932415 7 12.834684 8 13.142168 9 13.487008 Name: price_log1p, dtype: float64 Median living_measure_log1p by Quality: quality 6 7.346655 7 7.396949 8 7.673688 9 7.944847 Name: living_measure_log1p, dtype: float64 Median ceil_measure_log1p by Quality: quality 6 7.223296 7 7.186144 8 7.484930 9 7.852050 Name: ceil_measure_log1p, dtype: float64 Median basement_log1p by Quality: quality 6 0.0 7 0.0 8 0.0 9 0.0 Name: basement_log1p, dtype: float64 Median total_area_log1p by Quality: quality 6 9.274723 7 9.115040 8 9.161465 9 9.323669 Name: total_area_log1p, dtype: float64 Median house_age by Quality: quality 6 59.0 7 52.0 8 31.0 9 19.0 Name: house_age, dtype: float64
Plot 7: Distributions of the following numerical features _['price_log1p', 'living_measure_log1p', 'ceil_measure_log1p', 'basement_log1p', 'total_area_log1p', 'houseage'] which vary across different Qualities:
price_log1p against the different quality categories
price_log1p increases significantly with better quality, ranging from 12.32 for quality below 5 to 14.47 for quality above 12. This indicates that higher quality properties tend to have much higher prices.living_measure_log1p against the different quality categories
living_measure_log1p increases significantly with better quality, ranging from 6.75 for quality below 5 to 8.54 for quality above 12. This suggests that higher quality properties tend to have much larger living areas.ceil_measure_log1p against the different quality categories
ceil_measure_log1p increases significantly with better quality, ranging from 6.71 for quality below 5 to 8.45 for quality above 12. This indicates that higher quality properties tend to have much larger ceiling measurements.basement_log1p against the different quality categories
basement_log1p is 0 for qualities 6 to 11 and below 5, and 6.33 for quality above 12. This shows that basements are more common in properties with the highest quality.total_area_log1p against the different quality categories
total_area_log1p increases significantly with better quality, ranging from 8.99 for quality below 5 to 10.38 for quality above 12. This suggests that higher quality properties tend to have much larger total areas.house_age against the different quality categories
house_age decreases significantly with better quality, ranging from 76 years for quality below 5 to 18 years for quality above 12. This indicates that older houses tend to be of lower quality, while newer houses are generally of higher quality.# Bivariate Relationships
# ['price_log1p', 'room_bed', 'room_bath']
# Analyze the relationship between 'room_bed' and 'price_log1p'
plt.figure(figsize=(8, 6))
sns.boxplot(x='room_bed', y='price_log1p', data=df, palette='viridis')
plt.title('Box Plot of Price_log1p by Number of Bedrooms')
plt.xlabel('Number of Bedrooms')
plt.ylabel('Price_log1p')
plt.show()
# Analyze the relationship between 'room_bath' and 'price_log1p'
plt.figure(figsize=(8, 6))
sns.boxplot(x='room_bath', y='price_log1p', data=df, palette='viridis')
plt.title('Box Plot of Price_log1p by Number of Bathrooms')
plt.xlabel('Number of Bathrooms')
plt.ylabel('Price_log1p')
plt.show()
Plot 8: Bivariate Relationship Analysis
Room_Bed vs. price_log1p: Positive correlation; more bedrooms generally lead to higher prices, but outliers (e.g., large houses with fewer bedrooms) can skew the data.
Room_Bath vs. price_log1p: Positive correlation; more bathrooms typically result in higher prices, though the relationship isn't perfectly linear.
# Segment total area
df['total_area_log1p_segment'] = pd.cut(df['total_area_log1p'], bins=4, labels=['Small', 'Medium', 'Large', 'Very Large'])
# Plot the relationship between 'price_log1p' and 'total area' with segments
plt.figure(figsize=(10, 6))
sns.scatterplot(x='total_area_log1p', y='price_log1p', hue='total_area_log1p_segment', data=df, palette='viridis')
plt.title('Price_log1p vs. Total Area_log1p with Segmentation')
plt.xlabel('Total Area_log1p')
plt.ylabel('Price_log1p')
plt.show()
# Insights:
# Calculate median prices for each segment
median_prices_log1p_by_segment = df.groupby('total_area_log1p_segment')['price_log1p'].median()
print(median_prices_log1p_by_segment)
# Example insight generation (replace with your own analysis):
# Observe if there are noticeable differences in median price among the segments.
# For instance, if the median price for "Large" properties is significantly higher than for "Medium" properties,
# it corroborates the positive correlation.
# Print insights based on the median price for each segment.
for segment, median_price_log1p in median_prices_log1p_by_segment.items():
print(f"The median_price_log1p for {segment} properties is ${median_price_log1p:,.6f}")
# Analyze if there are any outliers within the segments and their implications.
# You can consider using boxplots to visualize this within each segment.
# Visualize the distribution of prices within each segment using box plots.
plt.figure(figsize=(10, 6))
sns.boxplot(x='total_area_log1p_segment', y='price_log1p', data=df, palette='viridis')
plt.title('Price_log1p Distribution by Total Area_log1p Segment')
plt.xlabel('Total Area_log1p Segment')
plt.ylabel('Price_log1p')
plt.show()
total_area_log1p_segment Small 13.000198 Medium 13.017004 Large 13.252032 Very Large 13.391566 Name: price_log1p, dtype: float64 The median_price_log1p for Small properties is $13.000198 The median_price_log1p for Medium properties is $13.017004 The median_price_log1p for Large properties is $13.252032 The median_price_log1p for Very Large properties is $13.391566
Plot 9: Insights drawn on total area segments and their median prices:
Reduction of Skewness: The log1p transformation helps to reduce the skewness of the data, making the distribution more normal. This can lead to more accurate and reliable statistical analyses and insights.
Relative Differences: The transformation compresses the range of values, which can make relative differences between segments more apparent. For example, the difference between median prices of different segments might appear smaller after transformation, but the relative ranking and trends remain the same.
Interpretation of Values: The transformed values are on a logarithmic scale, which means that differences in the transformed values correspond to multiplicative differences in the original scale. For instance, a difference of 1 in log1p values corresponds to a difference of approximately 2.718 times in the original values.
Handling Outliers: The log1p transformation reduces the impact of outliers, making the analysis more robust. This can lead to more stable insights, especially in datasets with extreme values.
Insights given we take log1p Transformation on both Price and Total Area:
Medium Properties Command the Highest Prices: The median price_log1p for Medium properties is the highest at $13.294, indicating high valuation in this segment.
Large Properties: The median price_log1p for Large properties is $13.253, suggesting slightly lower prices than Medium properties but higher than Small properties.
Very Large Properties: The median price_log1p for Very Large properties is $13.459, showing high prices in this segment, second only to Medium properties. These insights suggest that the log1p transformation helps to highlight the relative differences between segments while reducing the impact of extreme values, leading to more robust and interpretable results.
# S7age into categories (adjust bins as needed)
df['house_age_segment'] = pd.cut(df['house_age'], bins=4, labels=['New', 'Young', 'Mature', 'Old'])
# Plot the relationship between 'price' and 'house age' with segments
plt.figure(figsize=(10, 6))
sns.scatterplot(x='house_age', y='price_log1p', hue='house_age_segment', data=df, palette='viridis')
plt.title('Price vs. House Age with Segmentation')
plt.xlabel('House Age')
plt.ylabel('Price_log1p')
plt.show()
# Insights:
# Calculate median prices for each segment
median_prices_by_segment = df.groupby('house_age_segment')['price_log1p'].median()
print(median_prices_by_segment)
# Example insight generation (replace with your own analysis):
# Observe if there are noticeable differences in median price among the segments.
# For instance, if the median price for "New" properties is significantly higher than for "Old" properties,
# it corroborates the positive correlation.
# Print insights based on the median price for each segment.
for segment, median_price in median_prices_by_segment.items():
print(f"The median price for {segment} properties is ${median_price:,.6f}")
# Analyze if there are any outliers within the segments and their implications.
# You can consider using boxplots to visualize this within each segment.
# Visualize the distribution of prices within each segment using box plots.
plt.figure(figsize=(10, 6))
sns.boxplot(x='house_age_segment', y='price_log1p', data=df, palette='viridis')
plt.title('Price_log1p Distribution by House Age Segment')
plt.xlabel('House Age Segment')
plt.ylabel('Price_log1p')
plt.show()
house_age_segment New 13.120363 Young 12.923915 Mature 12.923915 Old 13.190024 Name: price_log1p, dtype: float64 The median price for New properties is $13.120363 The median price for Young properties is $12.923915 The median price for Mature properties is $12.923915 The median price for Old properties is $13.190024
Plot 10: Insights drawn from the provided data on house age segments and their median prices:
New Houses Command the Highest Prices: The median price for New houses is the highest at 500,000, indicating that newly built properties are highly valued.
Price Variation with Age: The median price decreases as houses age, with Mature houses having the lowest median price at 390,000. This suggests that older properties may be less valued compared to newer ones.
Young vs. Old Houses: Interestingly, Old houses have a slightly higher median price (451,000) compared to Young houses (449,894). This could indicate that some older properties have unique characteristics or historical value that make them more desirable.
Potential for Further Analysis: The data suggests that factors such as renovations, location, and historical significance might influence the price of older properties, warranting further investigation.
# Analyze the relationship between categorical features and the target variable
for col in ['ceil', 'coast', 'furnished']:
plt.figure(figsize=(10,6))
sns.boxplot(x=col, y='price_log1p', data=df, palette='viridis')
plt.title(f"Price_log1p distribution by {col}")
plt.show()
Plot 11: Relationship between each categorical feature and the target variable 'price':
Ceiling (ceil): Median and mean prices vary across ceiling categories, with larger ceiling areas generally associated with higher prices. Further investigation is needed to determine the significance.
Coast: Houses near the coast (coast = 1) have significantly higher median and mean prices compared to non-coastal houses (coast = 0), indicating a substantial premium for coastal properties.
Furnished: Furnished houses have higher median and mean prices compared to unfurnished ones, suggesting that furnishings contribute to higher property values.
# Calculate the mean price for each number of bedrooms
mean_price_by_bedrooms = df.groupby('room_bed')['price_log1p'].mean()
# Create the plot
plt.figure(figsize=(10, 6))
sns.barplot(x=mean_price_by_bedrooms.index, y=mean_price_by_bedrooms.values, palette='viridis')
plt.xlabel('Number of Bedrooms')
plt.ylabel('Mean Price_log1p')
plt.title('Mean Price_log1p by Number of Bedrooms')
plt.show()
Plot 12: Interpretation of Mean Price by Number of Bedrooms:
Above 7 bedrooms.# Plot mean price by sight view rating. add in words interpretation of the plot. In words my friend.
# Calculate the mean price for each sight view rating
mean_price_by_sight = df.groupby('sight')['price_log1p'].mean()
# Create the plot
plt.figure(figsize=(10, 6))
sns.barplot(x=mean_price_by_sight.index, y=mean_price_by_sight.values, palette='viridis')
plt.xlabel('Sight View Rating')
plt.ylabel('Mean Price_log1p')
plt.title('Mean Price_log1p by Sight View Rating')
plt.show()
Plot 13: Interpretation of Mean Price_log1p by Sight View Rating:
The house price prediction analysis identified key factors influencing property values, including property size, condition, quality, location, and house age. Larger living areas, coastal proximity, and high-quality homes significantly contribute to higher prices. Sales trends fluctuate, and external factors may influence demand. By leveraging predictive modeling, we can enhance price estimation accuracy and market insights.
Key Takeaways
Business & Strategic Recommendations
For CMO (Marketing & Customer Acquisition):
For COO (Operational & Risk Management):
For Data Science Manager (Predictive Modeling & AI Implementation):
# Drop the 'cid' column
# 'cid' is a unique identifier with no predictive value, it should be dropped to prevent data leakage or model bias
df = df.drop('cid', axis=1)
# zipcode as categorical variable, when doing hotone encoding will result in a huge number of columns
# The `target_encode` function performs target encoding on a categorical feature (like 'zipcode') to create a new numerical feature ('zip_code_target_encoded').
# Target encoding replaces the categorical values with a numerical representation that reflects the average value of the target variable ('price_log1p' in this case) for each category.
# This helps incorporate the predictive power of the categorical feature into a model.
# Here's a breakdown:
# 1. **Calculate the overall mean of the target variable:** `mean_target = df[target].mean()` This finds the average 'price_log1p' across the entire dataset.
# 2. **Group by the categorical feature and aggregate:** `agg = df.groupby(col)[target].agg(['count', 'mean'])` This groups the data by 'zipcode', then calculates the number of occurrences ('count') and the average 'price_log1p' ('mean') for each unique zip code.
# 3. **Smoothing:** `smooth = (counts * means + smoothing * mean_target) / (counts + smoothing)` This is the core of target encoding. It's a weighted average that combines the mean 'price_log1p' for each zip code with the overall mean 'price_log1p'. The `smoothing` parameter controls the influence of the overall mean. A higher `smoothing` value reduces the impact of zip codes with fewer observations, preventing overfitting to small sample sizes. In essence, it prevents extreme values from dominating the encoding for rare zipcodes.
# 4. **Mapping:** `return df[col].map(smooth)` This maps the original 'zipcode' values to the calculated smoothed means. This creates the new 'zip_code_target_encoded' column, where each zip code is now represented by a number that reflects its average 'price_log1p', adjusted for sample size using the smoothing parameter.
# **Example:**
# Let's say zip code 90210 has 10 houses with an average price of $1,000,000. Another zip code, 99999, has only 2 houses with an average price of $2,000,000.
# The overall average house price is $750,000. With a smoothing value of 10, the target encoding would be:
#
# **90210:** `(10 * $1,000,000 + 10 * $750,000) / (10 + 10) = $875,000`
# **99999:** `(2 * $2,000,000 + 10 * $750,000) / (2 + 10) = $916,666`
#
# Notice how the smaller sample size of 99999 is given less weight in determining its encoded value due to the smoothing factor.
# After this, the original `zipcode` column is dropped, because `zip_code_target_encoded` now captures its relevant information in a way that is usable for machine learning models.
def target_encode(df, col, target, smoothing=1):
mean_target = df[target].mean()
agg = df.groupby(col)[target].agg(['count', 'mean'])
counts = agg['count']
means = agg['mean']
smooth = (counts * means + smoothing * mean_target) / (counts + smoothing)
return df[col].map(smooth)
df['zip_code_target_encoded'] = target_encode(df, 'zipcode', 'price_log1p', smoothing=10)
df = df.drop('zipcode', axis=1)
# Change type of feature 'zip_code_target_encoded' into float numerical
df['zip_code_target_encoded'] = df['zip_code_target_encoded'].astype(float)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 21613 entries, 0 to 21612 Data columns (total 26 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 year_month 21613 non-null object 1 room_bed 21613 non-null category 2 room_bath 21613 non-null category 3 ceil 21613 non-null category 4 coast 21613 non-null category 5 sight 21613 non-null category 6 condition 21613 non-null category 7 quality 21613 non-null category 8 yr_renovated 21613 non-null int64 9 lat 21613 non-null int64 10 long 21613 non-null int64 11 furnished 21613 non-null category 12 sold_year 21613 non-null int64 13 house_age 21613 non-null int64 14 price_log1p 21613 non-null float64 15 living_measure_log1p 21613 non-null float64 16 living_measure15_log1p 21613 non-null float64 17 lot_measure15_log1p 21613 non-null float64 18 total_area_log1p 21613 non-null float64 19 ceil_measure_log1p 21613 non-null float64 20 basement_log1p 21613 non-null float64 21 avg_price_log1p_zipcode 21613 non-null float64 22 cluster 21613 non-null int32 23 total_area_log1p_segment 21613 non-null category 24 house_age_segment 21613 non-null category 25 zip_code_target_encoded 21613 non-null float64 dtypes: category(10), float64(9), int32(1), int64(5), object(1) memory usage: 2.8+ MB
# Extract numerical_cols and categorical_cols from df and print them
numerical_cols = df.select_dtypes(include=np.number).columns.tolist()
categorical_cols = df.select_dtypes(exclude=np.number).columns.tolist()
# Separate numerical and categorical features
df_numerical = df[numerical_cols]
df_categorical = df[categorical_cols]
# Standardize numerical features
scaler = StandardScaler()
df_numerical_scaled = pd.DataFrame(scaler.fit_transform(df_numerical), columns=numerical_cols)
# One-hot encode categorical features
df_categorical_encoded = pd.get_dummies(df_categorical, columns=categorical_cols, dtype=int, drop_first=True)
# Concatenate the scaled numerical and encoded categorical features
df_processed = pd.concat([df_numerical_scaled, df_categorical_encoded], axis=1)
df_processed.head()
| yr_renovated | lat | long | sold_year | house_age | price_log1p | living_measure_log1p | living_measure15_log1p | lot_measure15_log1p | total_area_log1p | ... | quality_7 | quality_8 | quality_9 | furnished_1 | total_area_log1p_segment_Medium | total_area_log1p_segment_Large | total_area_log1p_segment_Very Large | house_age_segment_Young | house_age_segment_Mature | house_age_segment_Old | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.210128 | 0.663544 | 0.00115 | 1.447901 | 0.193465 | 0.487460 | 1.112839 | 0.217640 | 0.129692 | 0.231546 | ... | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| 1 | -0.210128 | 0.663544 | 0.00115 | 1.447901 | 0.806318 | -1.695310 | -2.455518 | -0.382255 | -0.790171 | -1.316641 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 2 | -0.210128 | 0.663544 | 0.00115 | -0.690655 | 0.159418 | 0.872686 | 1.105104 | 1.012608 | -1.432076 | -0.839398 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 3 | -0.210128 | -1.507059 | 0.00115 | -0.690655 | -1.304621 | -1.121951 | -0.209011 | 0.232734 | -0.885586 | -0.837977 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | -0.210128 | 0.663544 | 0.00115 | 1.447901 | 1.623456 | -0.058624 | -1.246300 | -1.584655 | -0.521686 | -0.780338 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
5 rows × 53 columns
X = df_processed.drop(columns=['price_log1p']) # Removing target variable 'price_log1p'
y = df_processed['price_log1p']
# Splitting data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train models
models = {
'Linear Regression': LinearRegression(),
'Ridge Regression': Ridge(),
'Lasso Regression': Lasso(),
'Decision Tree Regression': DecisionTreeRegressor(),
'Random Forest Regressor': RandomForestRegressor(),
'Ada Boost Regressor': AdaBoostRegressor(),
'Gradient Boost Regressor': GradientBoostingRegressor(),
'XG Boost Regressor': XGBRegressor()
}
results = {}
results_fe = {}
for name, model in models.items():
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Evaluate models
mse = mean_squared_error(y_test, y_pred) # Root Mean Squared Error
rmse = np.sqrt(mse) # Calculate Root Mean Squared Error manually
r2 = r2_score(y_test, y_pred) # R-squared score
results[name] = {"RMSE": rmse, "R² Score": r2}
# Convert results to DataFrame and display
results_df = pd.DataFrame(results).T
# Define a dummy function for tools.display_dataframe_to_user
def display_dataframe_to_user(name, dataframe):
"""Displays the dataframe in a user friendly format."""
print(f"--- {name} ---")
display(dataframe)
# Create a DataFrame for original model performance
original_model_performance = pd.DataFrame(results).T.reset_index()
original_model_performance.rename(columns={'index': 'Model'}, inplace=True)
# Add train performance metrics
train_results = {}
for name, model in models.items():
y_train_pred = model.predict(X_train)
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
train_r2 = r2_score(y_train, y_train_pred)
train_results[name] = {"Train RMSE": train_rmse, "Train R² Score": train_r2}
train_results_df = pd.DataFrame(train_results).T.reset_index()
train_results_df.rename(columns={'index': 'Model'}, inplace=True)
# Merge with the original model performance DataFrame
original_model_performance = pd.merge(original_model_performance, train_results_df, on='Model')
# Rename the original model columns
original_model_performance.rename(columns={'RMSE': 'Test RMSE', 'R² Score': 'Test R² Score'}, inplace=True)
display_dataframe_to_user("Original Model Performance Comparison (Train vs Test)", original_model_performance)
--- Original Model Performance Comparison (Train vs Test) ---
| Model | Test RMSE | Test R² Score | Train RMSE | Train R² Score | |
|---|---|---|---|---|---|
| 0 | Linear Regression | 0.360576 | 0.868788 | 0.364812 | 0.867422 |
| 1 | Ridge Regression | 0.360572 | 0.868791 | 0.364816 | 0.867420 |
| 2 | Lasso Regression | 0.995529 | -0.000201 | 1.001923 | 0.000000 |
| 3 | Decision Tree Regression | 0.501102 | 0.746585 | 0.000345 | 1.000000 |
| 4 | Random Forest Regressor | 0.355431 | 0.872506 | 0.133604 | 0.982218 |
| 5 | Ada Boost Regressor | 0.472648 | 0.774547 | 0.470597 | 0.779387 |
| 6 | Gradient Boost Regressor | 0.354289 | 0.873324 | 0.345049 | 0.881398 |
| 7 | XG Boost Regressor | 0.335751 | 0.886233 | 0.214933 | 0.953981 |
Baseline Model Performance Comparison (Train vs Test)
The table provides performance metrics for various regression models. The three best-performing models based on the lowest Test RMSE and highest Test R² Score are:
XG Boost Regressor:
Gradient Boost Regressor:
Random Forest Regressor:
XG Boost Regressor is the best-performing model with the lowest test RMSE and highest test R² score, indicating excellent predictive accuracy and generalization.
Gradient Boost Regressor and Random Forest Regressor also perform very well, with low test RMSE and high test R² scores, making them strong contenders.
These three models show the best balance between low error (RMSE) and high explanatory power (R² Score) on the test data, indicating strong predictive performance.
Linear and Ridge Regression models perform similarly and quite well, with low RMSE and high R² scores, indicating good predictive accuracy and generalization.
Ada Boost Regressor has a higher RMSE and lower R² score compared to Linear and Ridge Regression, indicating it may not be the best choice for this dataset.
The Decision Tree Regression model shows signs of overfitting. It performs exceptionally well on the training data but not as well on the test data. This discrepancy suggests that the model may not generalize well to unseen data.
The Lasso Regression model is not performing well on this dataset. Both the training and test errors are high, and the R² scores are very low, indicating that the model is not fitting the data properly. This poor performance could be due to the nature of the dataset or the regularization parameter in Lasso Regression, which might be too strong, leading to underfitting.
def plot_feature_importances(model, model_name):
try:
feature_importances = model.feature_importances_
except AttributeError:
# Handle models without feature_importances_ (e.g., linear regression)
# Use coefficients for linear models
try:
feature_importances = np.abs(model.coef_)
except AttributeError:
print(f"Warning: {model_name} does not have feature_importances_ or coef_ attribute. Skipping feature importance plot.")
return [] # Return empty list to indicate no top features
feature_importance_df = pd.DataFrame({'Feature': X_train.columns, 'Importance': feature_importances})
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)
# Get top 3 features for this model
top_features = feature_importance_df['Feature'].head(3).tolist()
# Assign top features to global variables (use model_name for dynamic variable names)
globals()[model_name.replace(" ", "_") + "_top_features"] = top_features
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=feature_importance_df.head(15))
plt.title(f'Top 15 Feature Importances for {model_name}')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.show()
return top_features
xgb_model = models['XG Boost Regressor']
gb_model = models['Gradient Boost Regressor']
rf_model = models['Random Forest Regressor']
# Fit the XG Boost model if it hasn't been fitted yet
if not hasattr(xgb_model, 'feature_importances_'):
xgb_model.fit(X_train, y_train)
# Fit the Gradient Boosting model if it hasn't been fitted yet
if not hasattr(gb_model, 'feature_importances_'):
gb_model.fit(X_train, y_train)
# Fit the Random Forest model if it hasn't been fitted yet
if not hasattr(rf_model, 'feature_importances_'):
rf_model.fit(X_train, y_train)
xgb_top_features = plot_feature_importances(xgb_model, 'XG Boost Regressor')
gb_top_features = plot_feature_importances(gb_model, 'Gradient Boost Regressor')
rf_top_features = plot_feature_importances(rf_model, 'Random Forest Regressor')
print("XG Boost Regressor Top 3 Features:")
print(xgb_top_features)
print("\nGradient Boost Regressor Top 3 Features:")
print(gb_top_features)
print("\nRandom Forest Regressor Top 3 Features:")
print(rf_top_features)
XG Boost Regressor Top 3 Features: ['furnished_1', 'avg_price_log1p_zipcode', 'living_measure_log1p'] Gradient Boost Regressor Top 3 Features: ['avg_price_log1p_zipcode', 'living_measure_log1p', 'zip_code_target_encoded'] Random Forest Regressor Top 3 Features: ['avg_price_log1p_zipcode', 'living_measure_log1p', 'zip_code_target_encoded']
Interpretation of Top Features for the top 3 Best Models:
XG Boost Regressor:
furnished_1: This feature has the highest importance (0.547867), indicating that whether a property is furnished is the most influential factor in the model's predictions.avg_price_log1p_zipcode: The average price in the zip code (0.104854) is the second most important feature.living_measure_log1p: The logarithm of the living area size (0.060682) is the third most important feature.Gradient Boost Regressor:
living_measure_log1p: The logarithm of the living area size (0.288158) is the most important feature.avg_price_log1p_zipcode: The average price in the zip code (0.263669) is the second most important feature.zip_code_target_encoded: The target-encoded zip code (0.213280) is the third most important feature.Random Forest Regressor:
avg_price_log1p_zipcode: The average price in the zip code (0.293686) is the most important feature.living_measure_log1p: The logarithm of the living area size (0.273759) is the second most important feature.zip_code_target_encoded: The target-encoded zip code (0.178785) is the third most important feature.In summary:
zip_code_target_encoded.# Define parameter grids for each model
param_grids = {
'Linear Regression': {
'fit_intercept': [True, False]
},
'Decision Tree Regression': {
'max_depth': [10, 20],
'min_samples_split': [15],
'min_samples_leaf': [4, 8]
},
'Ada Boost Regressor': {
'n_estimators': [70, 90],
'learning_rate': [0.1, 0.3]
},
'Ridge Regression': {
'alpha': [0.2, 0.4, 0.6]
},
'Lasso Regression': {
'alpha': [0.1, 0.2]
},
'XG Boost Regressor': {
'n_estimators': [90],
'learning_rate': [0.1, 0.3],
'max_depth': [10, 20],
'subsample': [0.9]
},
'Random Forest Regressor': {
'n_estimators': [130],
'max_depth': [10, 20],
'min_samples_split': [10, 15],
'min_samples_leaf': [4]
},
'Gradient Boost Regressor': {
'n_estimators': [90],
'learning_rate': [0.1, 0.3],
'max_depth': [10, 20],
'subsample': [0.9]
}
}
# Perform RandomizedSearchCV for each model
tuned_models = {}
for name, model in models.items():
if name in param_grids:
random_search = RandomizedSearchCV(
estimator=model,
param_distributions=param_grids[name],
n_iter=15, # Number of parameter settings that are sampled
cv=5, # Number of cross-validation folds
scoring='neg_mean_squared_error', # Use negative MSE for minimization
n_jobs=-1, # Use all available cores
random_state=42 # For reproducibility
)
random_search.fit(X_train, y_train)
tuned_models[name] = random_search.best_estimator_
# Access and print the tuned parameters
# print(model.get_params())
print(f"Best Parameters for {name}:")
display(pd.DataFrame(random_search.best_params_, index=[0]))
else:
tuned_models[name] = model # Keep original model if no hyperparameter grid is defined.
print(f"No hyperparameter grid defined for {name}. Using default parameters.")
Best Parameters for Linear Regression:
| fit_intercept | |
|---|---|
| 0 | False |
Best Parameters for Ridge Regression:
| alpha | |
|---|---|
| 0 | 0.6 |
Best Parameters for Lasso Regression:
| alpha | |
|---|---|
| 0 | 0.1 |
Best Parameters for Decision Tree Regression:
| min_samples_split | min_samples_leaf | max_depth | |
|---|---|---|---|
| 0 | 15 | 8 | 10 |
Best Parameters for Random Forest Regressor:
| n_estimators | min_samples_split | min_samples_leaf | max_depth | |
|---|---|---|---|---|
| 0 | 130 | 15 | 4 | 20 |
Best Parameters for Ada Boost Regressor:
| n_estimators | learning_rate | |
|---|---|---|
| 0 | 90 | 0.1 |
Best Parameters for Gradient Boost Regressor:
| subsample | n_estimators | max_depth | learning_rate | |
|---|---|---|---|---|
| 0 | 0.9 | 90 | 10 | 0.1 |
Best Parameters for XG Boost Regressor:
| subsample | n_estimators | max_depth | learning_rate | |
|---|---|---|---|---|
| 0 | 0.9 | 90 | 10 | 0.1 |
# Evaluate tuned models and create a DataFrame for comparison
tuned_results = {}
for name, model in tuned_models.items():
y_pred = model.predict(X_test)
y_train_pred = model.predict(X_train)
test_rmse = np.sqrt(mean_squared_error(y_test, y_pred))
test_r2 = r2_score(y_test, y_pred)
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
train_r2 = r2_score(y_train, y_train_pred)
tuned_results[name] = {
"Test RMSE": test_rmse,
"Test R² Score": test_r2,
"Train RMSE": train_rmse,
"Train R² Score": train_r2
}
tuned_results_df = pd.DataFrame(tuned_results).T.reset_index()
tuned_results_df.rename(columns={'index': 'Model'}, inplace=True)
display_dataframe_to_user("Tuned Model Performance Comparison (Train vs Test)", tuned_results_df)
--- Tuned Model Performance Comparison (Train vs Test) ---
| Model | Test RMSE | Test R² Score | Train RMSE | Train R² Score | |
|---|---|---|---|---|---|
| 0 | Linear Regression | 0.360576 | 0.868788 | 0.364812 | 0.867422 |
| 1 | Ridge Regression | 0.360573 | 0.868790 | 0.364814 | 0.867421 |
| 2 | Lasso Regression | 0.467127 | 0.779783 | 0.475353 | 0.774906 |
| 3 | Decision Tree Regression | 0.409130 | 0.831071 | 0.342312 | 0.883272 |
| 4 | Random Forest Regressor | 0.357318 | 0.871148 | 0.255002 | 0.935224 |
| 5 | Ada Boost Regressor | 0.465870 | 0.780967 | 0.466321 | 0.783379 |
| 6 | Gradient Boost Regressor | 0.343287 | 0.881069 | 0.127303 | 0.983856 |
| 7 | XG Boost Regressor | 0.340234 | 0.883175 | 0.145966 | 0.978776 |
The table provides performance metrics for the various tuned regression models. The three best-performing tuned models based on the lowest Test RMSE and highest Test R² Score are:
XG Boost Regressor:
Gradient Boost Regressor:
Random Forest Regressor:
These three tuned models show the best balance between low error (RMSE) and high explanatory power (R² Score) on the test data, indicating they perform best on the test data.
xgb_t_model = tuned_models['XG Boost Regressor']
gb_t_model = tuned_models['Gradient Boost Regressor']
rf_t_model = tuned_models['Random Forest Regressor']
# Fit the XG Boost model if it hasn't been fitted yet
if not hasattr(xgb_t_model, 'feature_importances_'):
xgb_t_model.fit(X_train, y_train)
# Fit the Gradient Boosting model if it hasn't been fitted yet
if not hasattr(gb_t_model, 'feature_importances_'):
gb_t_model.fit(X_train, y_train)
# Fit the Random Forest model if it hasn't been fitted yet
if not hasattr(rf_t_model, 'feature_importances_'):
rf_t_model.fit(X_train, y_train)
xgb_t_top_features = plot_feature_importances(xgb_t_model, 'XG Boost Regressor')
gb_t_top_features = plot_feature_importances(gb_t_model, 'Gradient Boost Regressor')
rf_t_top_features = plot_feature_importances(rf_t_model, 'Random Forest Regressor')
print("Tuned XG Boost Regressor Top 3 Features:")
print(xgb_t_top_features)
print("\nTuned Gradient Boost Regressor Top 3 Features:")
print(gb_t_top_features)
print("\nTuned Random Forest Regressor Top 3 Features:")
print(rf_t_top_features)
Tuned XG Boost Regressor Top 3 Features: ['furnished_1', 'avg_price_log1p_zipcode', 'living_measure_log1p'] Tuned Gradient Boost Regressor Top 3 Features: ['avg_price_log1p_zipcode', 'living_measure_log1p', 'zip_code_target_encoded'] Tuned Random Forest Regressor Top 3 Features: ['living_measure_log1p', 'avg_price_log1p_zipcode', 'zip_code_target_encoded']
Comparison of tuned and non-tuned top 3 best performing models
XG Boost Regressor: The non-tuned model has a slightly better Test RMSE (0.335751 vs. 0.340234) and Test R² Score (0.886233 vs. 0.883175) compared to the tuned model.
Gradient Boost Regressor: The tuned model performs better with a lower Test RMSE (0.343721 vs. 0.354286) and higher Test R² Score (0.880768 vs. 0.873326).
Random Forest Regressor: The non-tuned model has a marginally better Test RMSE (0.354356 vs. 0.356682) and Test R² Score (0.873276 vs. 0.871607).
Overall, the non-tuned XG Boost Regressor performs slightly better than its tuned counterpart, while the tuned Gradient Boost Regressor and Random Forest Regressor show improvements over their non-tuned versions.
This comparison highlights the importance of tuning models to achieve optimal performance, though in some cases, non-tuned models can still perform exceptionally well.
Comparison of the other tuned and non-tuned models
The tuned Lasso Regression model performs significantly better than the non-tuned version, both on the Test RMSE (0.467127 vs. 0.995529) and Test R² Score (0.779783 vs. -0.000201), and training data Train RMSE (0.475353 vs. 1.001923) and higher Train R² Score (0.774906 vs. 0.000000). This highlights the importance of tuning hyperparameters to improve model performance.
The tuned Decision Tree Regression model outperforms the non-tuned version, with a lower Test RMSE (0.408620 vs. 0.498619) and a higher Test R² Score (0.831493 vs. 0.749091). This indicates that the tuned model has better accuracy and generalizes more effectively to the test data, avoiding the overfitting seen in the non-tuned model. Overall, tuning significantly improves the model's performance.
Overall Insights
XG Boost Regressor: The non-tuned model performs slightly better than the tuned version, indicating that tuning may not always be necessary for optimal performance.
Gradient Boost Regressor: The tuned model shows improved performance, highlighting the benefits of tuning for this model.
Random Forest Regressor: The non-tuned model performs marginally better, suggesting that the default settings are already quite effective.
Lasso Regression: The tuned model significantly outperforms the non-tuned version, demonstrating the importance of tuning for this model.
Decision Tree Regression: The tuned model performs better, indicating that tuning helps avoid overfitting and improves generalization. Conclusion: Tuning generally enhances model performance, especially for models like Gradient Boost, Lasso, and Decision Tree Regression. However, some models like XG Boost and Random Forest can still perform well without tuning, showing that the necessity of tuning can vary by model.
# Copy df_processed into a dataframe to be used for feature engineering
df_fe = df.copy()
df_fe.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 21613 entries, 0 to 21612 Data columns (total 26 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 year_month 21613 non-null object 1 room_bed 21613 non-null category 2 room_bath 21613 non-null category 3 ceil 21613 non-null category 4 coast 21613 non-null category 5 sight 21613 non-null category 6 condition 21613 non-null category 7 quality 21613 non-null category 8 yr_renovated 21613 non-null int64 9 lat 21613 non-null int64 10 long 21613 non-null int64 11 furnished 21613 non-null category 12 sold_year 21613 non-null int64 13 house_age 21613 non-null int64 14 price_log1p 21613 non-null float64 15 living_measure_log1p 21613 non-null float64 16 living_measure15_log1p 21613 non-null float64 17 lot_measure15_log1p 21613 non-null float64 18 total_area_log1p 21613 non-null float64 19 ceil_measure_log1p 21613 non-null float64 20 basement_log1p 21613 non-null float64 21 avg_price_log1p_zipcode 21613 non-null float64 22 cluster 21613 non-null int32 23 total_area_log1p_segment 21613 non-null category 24 house_age_segment 21613 non-null category 25 zip_code_target_encoded 21613 non-null float64 dtypes: category(10), float64(9), int32(1), int64(5), object(1) memory usage: 2.8+ MB
Creating aggregated features
# Count of premium features (e.g., cluster, furnished, sight)
# Why: Higher counts of premium features indicate more luxurious properties.
# Create the 'premium_features_count' feature
df_fe['premium_features_count'] = 0 # Initialize the feature to 0
# Define the conditions for premium features
conditions = [
(df_fe['cluster'] == 1),
(df_fe['furnished'] == 1),
(df_fe['sight'] == 2),
]
# Count the number of times each condition is met for each row
for condition in conditions:
df_fe['premium_features_count'] += condition.astype(int)
Creating composite features
# Weighted score for premium features based on perceived importance
# Why: Weighted scores prioritize more premium features
df_fe['premium_features_score'] = (
(df_fe['cluster'] == 1).astype(int) * 3 +
(df_fe['furnished'] == 1).astype(int) * 2 +
(df_fe['sight'] == 2).astype(int) * 2.5
)
# Weighted score for luxury features based on perceived importance
# Why: Weighted scores prioritize more luxury features
df_fe['luxury_features_score'] = (
(df_fe['cluster'] == 2).astype(int) * 3 +
(df_fe['furnished'] == 1).astype(int) * 2 +
(df_fe['sight'] == 3).astype(int) * 2.5
)
Creating interaction flags
# Interaction flags are binary indicators that represent the presence of specific feature combinations.
# These flags help the model quickly identify key interactions.
# Flag for Premium Houses combination
# Why: Houses with both the following features might appeal to high-income buyers looking for luxury houses.
df_fe['premium_flag'] = (
(df_fe['cluster'] == 1) & (df_fe['furnished'] == 1) & (df_fe['sight'] == 2)
).astype(int)
# Flag for Luxurious Houses combination
# Why: Houses with both the following features might appeal to high-income buyers looking for luxury houses.
df_fe['luxury_flag'] = (
(df_fe['cluster'] == 2) & (df_fe['furnished'] == 1) & (df_fe['sight'] == 3)
).astype(int)
# Extract numerical_cols and categorical_cols from df and print them
numerical_cols = df_fe.select_dtypes(include=np.number).columns.tolist()
categorical_cols = df_fe.select_dtypes(exclude=np.number).columns.tolist()
# Separate numerical and categorical features
df_fe_numerical = df_fe[numerical_cols]
df_fe_categorical = df_fe[categorical_cols]
# Scale all numerical features to avoid biase
scaler = StandardScaler()
df_numerical_fe_scaled = pd.DataFrame(
scaler.fit_transform(df_fe_numerical),
columns=df_fe_numerical.columns # Use df_fe_numerical.columns to get a 1D Index
)
# One-hot encode categorical features
df_categorical_fe_encoded = pd.get_dummies(df_fe_categorical, columns=categorical_cols, dtype=int, drop_first=True)
# Concatenate the scaled numerical and encoded categorical features
df_fe_processed = pd.concat([df_numerical_fe_scaled, df_categorical_fe_encoded], axis=1)
X_fe = df_fe_processed.drop(columns=['price_log1p']) # Removing target variable 'price_log1p'
y_fe = df_fe_processed['price_log1p']
# Splitting data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_fe, y_fe, test_size=0.3, random_state=42)
# Evaluate models
for name, model in models.items():
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred) # Root Mean Squared Error
rmse = np.sqrt(mse) # Calculate Root Mean Squared Error manually
r2 = r2_score(y_test, y_pred) # R-squared score
results_fe[name] = {"RMSE": rmse, "R² Score": r2}
# Convert results_fe to DataFrame and display
results_df_fe = pd.DataFrame(results_fe).T
# Create a DataFrame for original model performance
original_model_fe_performance = pd.DataFrame(results_fe).T.reset_index()
original_model_fe_performance.rename(columns={'index': 'Model'}, inplace=True)
# Add train performance metrics
train_fe_results = {}
for name, model in models.items():
y_train_pred = model.predict(X_train)
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
train_r2 = r2_score(y_train, y_train_pred)
train_fe_results[name] = {"Train RMSE": train_rmse, "Train R² Score": train_r2}
train_fe_results_df = pd.DataFrame(train_fe_results).T.reset_index()
train_fe_results_df.rename(columns={'index': 'Model'}, inplace=True)
# Merge with the original model performance DataFrame
original_model_fe_performance = pd.merge(original_model_fe_performance, train_fe_results_df, on='Model')
# Rename the original model columns
original_model_fe_performance.rename(columns={'RMSE': 'Test RMSE', 'R² Score': 'Test R² Score'}, inplace=True)
display_dataframe_to_user("Original Model Performance Comparison on the feature engineered dataframe (Train vs Test)", original_model_fe_performance)
--- Original Model Performance Comparison on the feature engineered dataframe (Train vs Test) ---
| Model | Test RMSE | Test R² Score | Train RMSE | Train R² Score | |
|---|---|---|---|---|---|
| 0 | Linear Regression | 0.360474 | 0.868862 | 0.364809 | 0.867425 |
| 1 | Ridge Regression | 0.360466 | 0.868868 | 0.364812 | 0.867422 |
| 2 | Lasso Regression | 0.995529 | -0.000201 | 1.001923 | 0.000000 |
| 3 | Decision Tree Regression | 0.496635 | 0.751083 | 0.000345 | 1.000000 |
| 4 | Random Forest Regressor | 0.352488 | 0.874608 | 0.133358 | 0.982284 |
| 5 | Ada Boost Regressor | 0.476469 | 0.770887 | 0.474534 | 0.775681 |
| 6 | Gradient Boost Regressor | 0.357141 | 0.871276 | 0.348560 | 0.878972 |
| 7 | XG Boost Regressor | 0.339101 | 0.883952 | 0.217732 | 0.952775 |
Impact of Feature Engineering and Hyperparameter Tuning on Regression Model Performance
Top 3 Performing Models
Here's a summary of the top 3 performing models from the feature engineered data frame, compared against the same models from the data frame without tuning or feature engineering, and the hyperparameter tuned results:
Top 3 Models with Feature Engineering:
Top 3 Models without Feature Engineering:
Top 3 Models with Hyperparameter Tuning:
Insights:
Other Models Comparison:
Let's take a closer look at the performance of the other models across the three scenarios: feature engineered, no tuning/feature engineering, and hyperparameter tuned.
Linear Regression:
Ridge Regression:
Lasso Regression:
Decision Tree Regression:
Ada Boost Regressor:
Insights:
Linear and Ridge Regression: Both models show very consistent performance across all scenarios, indicating that they are not significantly impacted by feature engineering or hyperparameter tuning.
Lasso Regression: Shows poor performance without tuning, but hyperparameter tuning significantly improves its performance.
Decision Tree Regression: Benefits from hyperparameter tuning, showing a notable improvement in Test R² Score.
Ada Boost Regressor: Shows moderate performance across all scenarios, with slight improvements from hyperparameter tuning.
Overall, hyperparameter tuning generally improves model performance, especially for models like Lasso Regression and Decision Tree Regression. Feature engineering has a more varied impact, with some models benefiting more than others.
xgb_fe_model = models['XG Boost Regressor']
gb_fe_model = models['Gradient Boost Regressor']
rf_fe_model = models['Random Forest Regressor']
# Fit the XG Boost model if it hasn't been fitted yet
if not hasattr(xgb_fe_model, 'feature_importances_'):
xgb_fe_model.fit(X_train, y_train)
# Fit the Gradient Boosting model if it hasn't been fitted yet
if not hasattr(gb_fe_model, 'feature_importances_'):
gb_fe_model.fit(X_train, y_train)
# Fit the Random Forest model if it hasn't been fitted yet
if not hasattr(rf_fe_model, 'feature_importances_'):
rf_fe_model.fit(X_train, y_train)
xgb_fe_top_features = plot_feature_importances(xgb_fe_model, 'XG Boost Regressor')
gb_fe_top_features = plot_feature_importances(gb_fe_model, 'Gradient Boost Regressor')
rf_fe_top_features = plot_feature_importances(rf_fe_model, 'Random Forest Regressor')
print("Feature Engineered XG Boost Regressor Top 3 Features:")
print(xgb_fe_top_features)
print("\nFeature Engineered Gradient Boost Regressor Top 3 Features:")
print(gb_fe_top_features)
print("\nFeature Engineered Random Forest Regressor Top 3 Features:")
print(rf_fe_top_features)
Feature Engineered XG Boost Regressor Top 3 Features: ['luxury_features_score', 'furnished_1', 'avg_price_log1p_zipcode'] Feature Engineered Gradient Boost Regressor Top 3 Features: ['luxury_features_score', 'living_measure_log1p', 'zip_code_target_encoded'] Feature Engineered Random Forest Regressor Top 3 Features: ['luxury_features_score', 'living_measure_log1p', 'avg_price_log1p_zipcode']
Feature Importance Comparative Analysis of Top Models Across Different Scenarios: Feature Engineering, Hyperparameter Tuning, and Baseline
Feature Importance Consistency:
Luxury Features Score: This feature is consistently important for models with feature engineering, indicating its strong predictive power when engineered features are included.
Avg Price Log1p Zipcode: This feature appears in the top 3 for all models across all scenarios, highlighting its overall significance in predicting the target variable.
Impact of Feature Engineering:
Hyperparameter Tuning:
Model-Specific Insights:
XG Boost Regressor: The feature furnished_1 is important in both the engineered and tuned scenarios, suggesting that this model benefits from categorical features.
Gradient Boost Regressor: The feature zip_code_target_encoded is consistently important, showing that this model leverages target encoding effectively.
Random Forest Regressor: The feature living_measure_log1p is consistently important, indicating that this model relies heavily on this feature for predictions.
These insights can help in understanding how different preprocessing and tuning strategies impact model performance and feature importance.
Final Business Insights & Recommendations
XGBoost is the best model for house price prediction due to its robust performance across all scenarios.
Feature engineering added meaningful insights, with luxury features playing a crucial role in price determination.
Hyperparameter tuning benefits some models (Gradient Boosting, Decision Tree) but is not always necessary for the best models.
Future improvements: